Information Extraction from Historical Semi-Structured Handwritten Documents
نویسندگان
چکیده
In this paper, we describe our approach to extract salient events such as birth and death records from historical French parish documents that contain free-form handwritten text. The challenges posed by these documents to the current state of the art in handwriting recognition and information extraction go well beyond the generic challenges in recognizing handwritten text such as style variations, irregular baselines, poor legibility, etc. Our approach for extracting salient events from such documents has the following processing steps: (1) pre-processing for noise removal and high-quality binarization, (2) OCR for text recognition, and (3) statistical information extraction for event record extraction. In this paper, we focus on preprocessing techniques for robust binarization in presence of different types of degradations that are common in historical documents. We provide a detailed description of our system, experimental setup, and results for each stage of the processing. In addition, we compare different approaches for preprocessing by assessing their impact on OCR performance.
منابع مشابه
Survey on Information Extraction from Chemical Compound Literatures: Techniques and Challenges
Chemical documents, especially those involving drug information, comprise a variety of types – the most common being journal articles, patents and theses. They typically contain large amounts of chemical information, such as PubMed-ID, activity classes and adverse or side effects. Techniques are used to extract information from a huge number of documents and it is presented in a useful structur...
متن کاملW Web Information Extraction
Information extraction (IE) is the process of automatically extracting structured pieces of information from unstructured or semi-structured text documents. Classical problems in information extraction include named-entity recognition (identifying mentions of persons, places, organizations, etc.) and relationship extraction (identifying mentions of relationships between such named entities). We...
متن کاملPopulating Ontologies with Data from OCRed Lists
A flexible, accurate, and efficient method of automatically extracting facts from lists in OCRed documents and inserting them into an ontology would help make those facts machine searchable, queryable, and linkable and expose their rich ontological interrelationships. To work well, such a process must be adaptable to variations in list format, tolerant of OCR errors, and careful in its selectio...
متن کاملPopulating Ontologies by Semi-automatically Inducing Information Extraction Wrappers for Lists in OCRed Documents
A flexible, accurate, and efficient method of extracting facts from lists in OCRed documents and inserting them into an ontology would help make those facts machine queryable, linkable, and editable. But, to work well, such a process must be adaptable to variations in list format, tolerant of OCR errors, and careful in its selection of human guidance. We propose a wrapper-induction solution for...
متن کاملSpace characters in Chinese semi-structured texts
Space characters can have an important role in disambiguating text. However, few, if any, Chinese information extraction systems make full use of space characters. However, it seems that treatment of space characters is necessary, especially in cases of extracting information from semi-structured documents. This investigation aims to address the importance of space characters in Chinese informa...
متن کامل